Reccomender
Basing this tutorial from the work of Marcel Caraciolo at http://aimotion.blogspot.com/2012/08/introduction-to-recommendations-with.html
Our goal is to calculate how similar pairs of movies are, so that we recommend movies similar to movies you liked. Using the correlation we can:
We are going to work of data set of movie ratings from: http://grouplens.org/datasets/movielens/ For this task we will use the MovieLens Dataset of Movie Ratings with 10.000 ratings from 1000 users on 1700 movies (you can download it at this http://www.grouplens.org/node/73 ).
So the first step is to get our movies file which has three columns: (user, movie, rating). For this task we will use the MovieLens Dataset of Movie Ratings with 10.000 ratings
You want to compute how similar pairs of movies are, so that if someone watches the movie The Matrix, you can recommend movies like BladeRunner. So how should you define the similarity between two movies ?
One possibility is to compute their correlation. The basic idea behind it is for every pair of movies A and B, find all the people who rated both A and B. Use these ratings to form a Movie A vector and a Movie B vector. Then, calculate the correlation between these two vectors. Now when someone watches a movie, you can now recommend him the movies most correlated with it.
So let's divide to conquer. Our first task is for each user, emit a row containing their 'postings' (item, rating). And for reducer, emit the user rating sum and count for use later steps.
In [2]:
def group_by_user_rating(self, key, line):
"""
Emit the user_id and group by their ratings (item and rating)
17 70,3
35 21,1
49 19,2
49 21,1
49 70,4
87 19,1
87 21,2
98 19,2
"""
user_id, item_id, rating = line.split('|')
#yield (item_id, int(rating)), user_id
#yield item_id, (user_id, int(rating))
yield user_id, (item_id, float(rating))
#yield (user_id, item_id), int(rating)
def count_ratings_users_freq(self, user_id, values):
"""
For each user, emit a row containing their "postings"
(item,rating pairs)
Also emit user rating sum and count for use later steps.
17 1,3,(70,3)
35 1,1,(21,1)
49 3,7,(19,2 21,1 70,4)
87 2,3,(19,1 21,2)
98 1,2,(19,2)
"""
item_count = 0
item_sum = 0
final = []
for item_id, rating in values:
item_count += 1
item_sum += rating
final.append((item_id, rating))
yield user_id, (item_count, item_sum, final)
Before using these rating pairs to calculate correlation, let's see how we can compute it. We know that they can be formed as vectors of ratings, so we can use linear algebra to perform norms and dot products, as alo to compute the length of each vector or the sum over all elements in each vector. By representing them as matrices, we can perform several operations on those movies.
In [3]:
def pairwise_items(self, user_id, values):
'''
The output drops the user from the key entirely, instead it emits
the pair of items as the key:
19,21 2,1
19,70 2,4
21,70 1,4
19,21 1,2
This mapper is the main performance bottleneck. One improvement
would be to create a java Combiner to aggregate the
outputs by key before writing to hdfs, another would be to use
a vector format and SequenceFiles instead of streaming text
for the matrix data.
'''
item_count, item_sum, ratings = values
#print item_count, item_sum, [r for r in combinations(ratings, 2)]
#bottleneck at combinations
for item1, item2 in combinations(ratings, 2):
yield (item1[0], item2[0]), \
(item1[1], item2[1])
def calculate_similarity(self, pair_key, lines):
'''
Sum components of each corating pair across all users who rated both
item x and item y, then calculate pairwise pearson similarity and
corating counts. The similarities are normalized to the [0,1] scale
because we do a numerical sort.
19,21 0.4,2
21,19 0.4,2
19,70 0.6,1
70,19 0.6,1
21,70 0.1,1
70,21 0.1,1
'''
sum_xx, sum_xy, sum_yy, sum_x, sum_y, n = (0.0, 0.0, 0.0, 0.0, 0.0, 0)
item_pair, co_ratings = pair_key, lines
item_xname, item_yname = item_pair
for item_x, item_y in lines:
sum_xx += item_x * item_x
sum_yy += item_y * item_y
sum_xy += item_x * item_y
sum_y += item_y
sum_x += item_x
n += 1
similarity = normalized_correlation(n, sum_xy, sum_x, sum_y, \
sum_xx, sum_yy)
yield (item_xname, item_yname), (similarity, n)
To summarize, each row in calculate similarity will compute the number of people who rated both movie and movie2 , the sum over all elements in each ratings vectors (sum_x, sum_y) and the squared sum of each vector (sum_xx, sum__yy). So we can now can calculate the correlation between the movies. The correlation can be expressed as:
So that's it! Now the last step of the job that will sort the top-correlated items for each item and print it to the output.
In [ ]:
def calculate_ranking(self, item_keys, values):
'''
Emit items with similarity in key for ranking:
19,0.4 70,1
19,0.6 21,2
21,0.6 19,2
21,0.9 70,1
70,0.4 19,1
70,0.9 21,1
'''
similarity, n = values
item_x, item_y = item_keys
if int(n) > 0:
yield (item_x, similarity), (item_y, n)
def top_similar_items(self, key_sim, similar_ns):
'''
For each item emit K closest items in comma separated file:
De La Soul;A Tribe Called Quest;0.6;1
De La Soul;2Pac;0.4;2
'''
item_x, similarity = key_sim
for item_y, n in similar_ns:
print '%s;%s;%f;%d' % (item_x, item_y, similarity, n)
All of it in one file MovieSimilarities.py
In [ ]:
# %load code/MovieSimilarities.py
'''
Given a dataset of movies and their ratings by different
users, how can we compute the similarity between pairs of
movies?
This module computes similarities between movies
by representing each movie as a vector of ratings and
computing similarity scores over these vectors.
Copied from:
https://github.com/marcelcaraciolo/recsys-mapreduce-mrjob/blob/master/moviesSimilarities.py
'''
__author__ = 'Marcel Caraciolo <caraciol@gmail.com>'
from mrjob.job import MRJob
from metrics import correlation
from metrics import cosine, regularized_correlation
from math import sqrt
try:
from itertools import combinations
except ImportError:
from metrics import combinations
PRIOR_COUNT = 10
PRIOR_CORRELATION = 0
class SemicolonValueProtocol(object):
# don't need to implement read() since we aren't using it
def write(self, key, values):
return ';'.join(str(v) for v in values)
class MoviesSimilarities(MRJob):
OUTPUT_PROTOCOL = SemicolonValueProtocol
def steps(self):
return [
self.mr(mapper=self.group_by_user_rating,
reducer=self.count_ratings_users_freq),
self.mr(mapper=self.pairwise_items,
reducer=self.calculate_similarity),
self.mr(mapper=self.calculate_ranking,
reducer=self.top_similar_items)]
def group_by_user_rating(self, key, line):
"""
Emit the user_id and group by their ratings (item and rating)
17 70,3
35 21,1
49 19,2
49 21,1
49 70,4
87 19,1
87 21,2
98 19,2
"""
user_id, item_id, rating = line.split('\t')
#yield (item_id, int(rating)), user_id
#yield item_id, (user_id, int(rating))
yield user_id, (item_id, float(rating))
#yield (user_id, item_id), int(rating)
def count_ratings_users_freq(self, user_id, values):
"""
For each user, emit a row containing their "postings"
(item,rating pairs)
Also emit user rating sum and count for use later steps.
17 1,3,(70,3)
35 1,1,(21,1)
49 3,7,(19,2 21,1 70,4)
87 2,3,(19,1 21,2)
98 1,2,(19,2)
"""
item_count = 0
item_sum = 0
final = []
for item_id, rating in values:
item_count += 1
item_sum += rating
final.append((item_id, rating))
yield user_id, (item_count, item_sum, final)
def pairwise_items(self, user_id, values):
'''
The output drops the user from the key entirely, instead it emits
the pair of items as the key:
19,21 2,1
19,70 2,4
21,70 1,4
19,21 1,2
This mapper is the main performance bottleneck. One improvement
would be to create a java Combiner to aggregate the
outputs by key before writing to hdfs, another would be to use
a vector format and SequenceFiles instead of streaming text
for the matrix data.
'''
item_count, item_sum, ratings = values
#print item_count, item_sum, [r for r in combinations(ratings, 2)]
#bottleneck at combinations
for item1, item2 in combinations(ratings, 2):
yield (item1[0], item2[0]), \
(item1[1], item2[1])
def calculate_similarity(self, pair_key, lines):
'''
Sum components of each corating pair across all users who rated both
item x and item y, then calculate pairwise pearson similarity and
corating counts. The similarities are normalized to the [0,1] scale
because we do a numerical sort.
19,21 0.4,2
21,19 0.4,2
19,70 0.6,1
70,19 0.6,1
21,70 0.1,1
70,21 0.1,1
'''
sum_xx, sum_xy, sum_yy, sum_x, sum_y, n = (0.0, 0.0, 0.0, 0.0, 0.0, 0)
item_pair, co_ratings = pair_key, lines
item_xname, item_yname = item_pair
for item_x, item_y in lines:
sum_xx += item_x * item_x
sum_yy += item_y * item_y
sum_xy += item_x * item_y
sum_y += item_y
sum_x += item_x
n += 1
corr_sim = correlation(n, sum_xy, sum_x, \
sum_y, sum_xx, sum_yy)
reg_corr_sim = regularized_correlation(n, sum_xy, sum_x, \
sum_y, sum_xx, sum_yy, PRIOR_COUNT, PRIOR_CORRELATION)
cos_sim = cosine(sum_xy, sqrt(sum_xx), sqrt(sum_yy))
jaccard_sim = 0.0
yield (item_xname, item_yname), (corr_sim, \
cos_sim, reg_corr_sim, jaccard_sim, n)
def calculate_ranking(self, item_keys, values):
'''
Emit items with similarity in key for ranking:
19,0.4 70,1
19,0.6 21,2
21,0.6 19,2
21,0.9 70,1
70,0.4 19,1
70,0.9 21,1
'''
corr_sim, cos_sim, reg_corr_sim, jaccard_sim, n = values
item_x, item_y = item_keys
if int(n) > 0:
yield (item_x, corr_sim, cos_sim, reg_corr_sim, jaccard_sim), \
(item_y, n)
def top_similar_items(self, key_sim, similar_ns):
'''
For each item emit K closest items in comma separated file:
De La Soul;A Tribe Called Quest;0.6;1
De La Soul;2Pac;0.4;2
'''
item_x, corr_sim, cos_sim, reg_corr_sim, jaccard_sim = key_sim
for item_y, n in similar_ns:
yield None, (item_x, item_y, corr_sim, cos_sim, reg_corr_sim,
jaccard_sim, n)
if __name__ == '__main__':
MoviesSimilarities.run()
In [4]:
%run code/MovieSimilarities.py data/ml-100k/ml-100k/u.data > data/output.csv
In [ ]:
In [ ]: